TGAGA annotation v0.1.0

more conservative v0.1.0 annotation

aquacul4

running

cg010_blastn

Also will attempt to clean up first blast output ((~58k hits))

> removing alnlength <100

> removing e-value < E-10

> sorted by query, then evalue. Then removed duplicates on query and query start column. (reduced to 5786)

> repeated with query end value (3891)

Then going with annotation on Galaxy

Galaxy52-[Join_two_Datasets_on_data_32_and_data_50].tabular

new GFF

Annotation1_cg_v010.gff

Blast2gff code looks something like this

./Blast2Gff.pl -i /Volumes/Bay3\ scratch/gff_fun/7 -o /Volumes/Bay3\ scratch/gff_fun/Combined_fosmids_cd_hit_mod_20000_7trim.gff -d "sigenae_v8" -p EXON -s "something"

Flipping it around and taking sigenae v8 and blasting tgagag v0.1.0 on Server (SW) using -G 1 and -E 1

will modify Blast2gff script to try pull out relevant information.

BLAST COMPLETE

output: http://aquacul4.fish.washington.edu/~steven/filefish/sigena8_blast_v010.txt

7.4 Million lines

will Use Galaxy to filter…

align length >100

;; down to 43,061 lines..

from original

evalue < 0.01

;; down to 110,000 lines

from original

evalue < 0.0001

;; down to 65,842 lines

from there going to trim to > 100 algnlength

;; now at 37,274 lines

Galaxy57-[Filter_on_data_56].tabular

---------

Back to original

trimming on %ID

c3>=95

about a million lines

---

Now will filter

the 37274 file (evalue, algnlength)

with

c3>=95

1105 lines

Galaxy60-[Filter_on_data_57].tabular

NEW GFF file

sig8_blast_v010_flp.gff

also known as Annotation2_cg_v010

NOTE need to have col 9 indicate "name="

final-Annotation2_cg_v010.gff

running an MBD ref map on it.

IDEA need to get know gene structure an validate an approach.